Confidence Estimation for Automatic Speech Recognition Hypotheses
نویسنده
چکیده
Automatic speech recognition (ASR) systems produce transcriptions for audio which sometimes contain errors. It is useful to know how much condence may be placed in this output being correct. Condence estimation is concerned with obtaining scores which quantify this level of condence. e development and application of a principled, exible framework using conditional random eld (CRF) models for condence estimation is described. Errors tend to occur over a number of consecutive words in ASR output. is phenomenon is not typically accounted for in condence estimation, but is exploited here through the sequential nature of the CRF. A custom CRF framework is developed, making it possible for useful feature functions to be engineered. is framework is extended to support hidden-state CRFs. To inform this condence estimation model, novel predictor features indicative of the quality of ASR hypotheses are proposed, along with a technique for their extraction from lattices. e CRF-based approach is used to combine multiple predictor features and estimate condence scores for words in ASR hypotheses. is yields performance improvements in the normalised cross entropy (NCE) metric of up to 11.4% relative to a strong baseline (using decision trees). e novel application of a hidden-state CRF to this task yields further relative improvements of up to 17.2%. Estimating condence scores on the sub-word-level is also investigated. Sub-word-level features are combined with word-level features to yield improvements of up to 31.7%relative. e use of a hiddenstate CRF for this task yields even larger relative gains of up to 48.6%. e application of CRFs to estimate keyterm condence scores for spoken term detection is proposed. Discriminative features for keyterm hypotheses are introduced, as well as a model-based approach to keyterm score normalisation. is approach results in improvements of 26% and 36% relative in the miss rate and false alarm rate at operating points of interest. e novel task of detecting deletions within ASR output is investigated. e sequential nature of the CRF is exploited to make this possible, such that regions in which deletions occur are modelled. Modelling word condence and deletion regions simultaneously yields an approach which is capable of detecting deletions. Overall, the proposed framework for condence estimation is shown to yield improved condence estimates. is is important for downstream applications (e.g. dialogue systems, keyterm detection) which make decisions based on these scores, as well as in-system applications (e.g. data selection and adaptation).
منابع مشابه
Driving ROVER with Segment-based ASR Quality Estimation
ROVER is a widely used method to combine the output of multiple automatic speech recognition (ASR) systems. Though effective, the basic approach and its variants suffer from potential drawbacks: i) their results depend on the order in which the hypotheses are used to feed the combination process, ii) when applied to combine long hypotheses, they disregard possible differences in transcription q...
متن کاملCombining Information Sources for Confidence Estimation with CRF Models
Obtaining accurate confidence measures for automatic speech recognition (ASR) transcriptions is an important task which stands to benefit from the use of multiple information sources. This paper investigates the application of conditional random field (CRF) models as a principled technique for combining multiple features from such sources. A novel method for combining suitably defined features ...
متن کاملA Database for Automatic Persian Speech Emotion Recognition: Collection, Processing and Evaluation
Abstract Recent developments in robotics automation have motivated researchers to improve the efficiency of interactive systems by making a natural man-machine interaction. Since speech is the most popular method of communication, recognizing human emotions from speech signal becomes a challenging research topic known as Speech Emotion Recognition (SER). In this study, we propose a Persian em...
متن کاملGinisupport vector machines for segmental minimum Bayes risk decoding of continuous speech
We describe the use of Support Vector Machines (SVMs) for continuous speech recognition by incorporating them in Segmental Minimum Bayes Risk decoding. Lattice cutting is used to convert the Automatic Speech Recognition search space into sequences of smaller recognition problems. SVMs are then trained as discriminative models over each of these problems and used in a rescoring framework. We pos...
متن کاملAutomatic quality estimation for ASR system combination
Recognizer Output Voting Error Reduction (ROVER) has been widely used for system combination in automatic speech recognition (ASR). In order to select the most appropriate words to insert at each position in the output transcriptions, some ROVER extensions rely on critical information such as confidence scores and other ASR decoder features. This information, which is not always available, high...
متن کاملWord Confidence Estimation for Speech Translation
Word Confidence Estimation (WCE) for machine translation (MT) or automatic speech recognition (ASR) consists in judging each word in the (MT or ASR) hypothesis as correct or incorrect by tagging it with an appropriate label. In the past, this task has been treated separately in ASR or MT contexts and we propose here a joint estimation of word confidence for a spoken language translation (SLT) t...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2013